fix: restore Arrow nullable flags lost in Polars round-trips (ENG-375)#130
fix: restore Arrow nullable flags lost in Polars round-trips (ENG-375)#130
Conversation
Polars converts all Arrow fields to nullable=True when producing its Arrow output, corrupting schema intent for non-optional columns. Add `arrow_utils.restore_schema_nullability()` which reinstates the original nullable flag for each field by name using the reference schemas captured before any Polars operation. This is schema-intent-based (unlike `infer_schema_nullable` which is data-based and would incorrectly mark Optional columns as non-nullable when they happen to have no nulls). Apply the fix at all Polars round-trip sites: - FunctionNode: all 5 join sites and the as_table sort - Join.static_process: the per-iteration inner join loop (removes the previous infer_schema_nullable workaround) Tests added (red-green-refactor): - Unit tests for restore_schema_nullability including the Optional[T] correctness case that infer_schema_nullable fails - Integration tests for FunctionNode.get_all_records, FunctionNode._iter_all_from_database, and Join.op_forward Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Add TestJoinTagColumnNullability with three cases: 1. Shared tag columns with mixed nullable flags — both "id" (nullable=False) and "group" (nullable=True, Optional, no actual nulls) are preserved through the Polars inner join on those shared keys. 2. Non-shared tag columns from each side of a cartesian-product join — "id" (non-nullable) from source1 and "category" (nullable) from source2 both retain their flags in the result. 3. Three-way join (two Polars iterations) — verifies that restore_schema_ nullability is applied correctly at each iteration, not just the last. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…creation add_system_tag_columns and add_source_info were appending columns with table.append_column(string_name, array), which Arrow defaults to nullable=True. Those columns are always computed and never null, so nullable=False is the correct schema intent. This caused a regression in TestJoinOutputSchemaSystemTags after restore_schema_nullability replaced infer_schema_nullable in Join: restore_schema_nullability faithfully preserves nullable=True for these columns (from the reference schema captured before the Polars join), while the operator's output_schema prediction expected nullable=False. Fix: pass an explicit pa.field(..., nullable=False) to append_column in both functions, so the schema is correct from the moment of creation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a schema-integrity regression where Polars round-trips (Arrow -> Polars -> Arrow) drop Arrow field nullability intent by outputting all fields as nullable=True. It introduces a schema-intent-based restoration step that re-applies the original nullable flags (and field types/metadata) using reference schemas captured before Polars operations.
Changes:
- Added
arrow_utils.restore_schema_nullability()to restore Arrow field nullability (and field type/metadata) from one or more reference schemas. - Updated
FunctionNodeandJoin.static_processto capture pre-Polars schemas and restore nullability after Polars joins/sorts (replacing the prior data-basedinfer_schema_nullableworkaround inJoin). - Added unit + integration tests covering round-trip nullability preservation, including the “Optional[T] with no nulls” case.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/orcapod/utils/arrow_utils.py |
Adds restore_schema_nullability() and explicitly marks computed/system/source columns as non-nullable. |
src/orcapod/core/operators/join.py |
Restores schema nullability after Polars join; removes infer_schema_nullable cast. |
src/orcapod/core/nodes/function_node.py |
Restores schema nullability after Polars joins/sorts in multiple code paths. |
tests/test_data/test_polars_nullability/test_restore_schema_nullability.py |
Unit tests for restore_schema_nullability, plus demonstrations of the Polars behavior. |
tests/test_data/test_polars_nullability/test_function_node_nullability.py |
Integration tests ensuring FunctionNode + Join preserve nullable intent across joins. |
tests/test_data/test_polars_nullability/__init__.py |
Package marker for the new test module. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| import pyarrow as pa | ||
| import polars as pl | ||
| import pytest |
There was a problem hiding this comment.
pytest is imported but not used in this test module. Please remove it to avoid unused-import noise (and potential lint failures if ruff/pyflakes is enabled in CI).
| import pytest |
| assert joined.schema.field("score").nullable is True | ||
|
|
||
|
|
||
| class TestRestoreSchemaHullability: |
There was a problem hiding this comment.
Typo in class name: TestRestoreSchemaHullability should be TestRestoreSchemaNullability (or similar) to match the helper being tested.
| class TestRestoreSchemaHullability: | |
| class TestRestoreSchemaNullability: |
| import pyarrow as pa | ||
| import pytest | ||
|
|
||
| import orcapod as op | ||
| from orcapod.core.nodes.function_node import FunctionNode | ||
| from orcapod.databases import InMemoryArrowDatabase |
There was a problem hiding this comment.
pytest is imported but not used in this module. Please remove it to avoid unused-import noise (and potential lint failures if linting is enabled).
src/orcapod/utils/arrow_utils.py
Outdated
| @@ -1093,7 +1170,11 @@ def add_source_info( | |||
| [f"{source_val}::{col}" for source_val in source_column], | |||
| type=pa.large_string(), | |||
| ) | |||
| table = table.append_column(f"{constants.SOURCE_PREFIX}{col}", source_column) | |||
| # Source info columns are always computed strings, never null. | |||
There was a problem hiding this comment.
source_column is overwritten inside the per-column loop, so the source-info token accumulates previous column names (e.g., src::col1::col2) rather than staying src::col2 for the col2 source column. Preserve the original per-row source_info values (e.g., keep a separate base list/array) and build each _source_{col} column from that base.
| if column_config.sort_by_tags: | ||
| output_table_schema = output_table.schema | ||
| output_table = ( | ||
| pl.DataFrame(output_table) | ||
| .sort(by=self.keys()[0], descending=False) | ||
| .to_arrow() | ||
| ) | ||
| output_table = arrow_utils.restore_schema_nullability(output_table, output_table_schema) |
There was a problem hiding this comment.
This PR restores nullability after Polars .sort(...).to_arrow() in FunctionNode.as_table, but there is another Polars sort round-trip in FunctionPodStream.as_table (src/orcapod/core/function_pod.py, around the column_config.sort_by_tags block) that still drops nullable flags. Consider applying restore_schema_nullability there too for consistent behavior across stream/table materialization paths.
- Remove unused `import pytest` from both test modules - Rename TestRestoreSchemaHullability → TestRestoreSchemaNullability (typo fix) - Fix source_column accumulation bug in add_source_info: each _source_<col> column was being built from the array produced for the previous column, causing tokens like "src::col1::col2" instead of "src::col2". Introduce base_source captured once before the loop so every column is independently derived from the original per-row source tokens. - Apply restore_schema_nullability after the Polars sort in FunctionPodStream.as_table (function_pod.py) to match the same fix already applied in FunctionNode.as_table, covering the stream materialization path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Addressed all five review comments in commit 3218c4c:
1 & 3 — Unused import pytest
Removed from both test_restore_schema_nullability.py and test_function_node_nullability.py.
2 — Typo in class name
TestRestoreSchemaHullability → TestRestoreSchemaNullability.
4 — source_column accumulation bug in add_source_info
Real bug — thanks for catching it. The loop was reassigning source_column to the pa.array built for each column, so the next iteration started from already-appended values (e.g. "src::col1::col2" instead of "src::col2"). Fixed by introducing base_source (captured once before the loop) and building each per-column array from that base independently.
5 — Missing restore_schema_nullability in FunctionPodStream.as_table
Applied the same pattern (capture schema before, restore after) to the sort_by_tags Polars sort in function_pod.py, consistent with the equivalent fix in FunctionNode.as_table.
…375) TestAddSourceInfo in test_arrow_utils.py covers four cases: 1. Single column — baseline: _source_x = "src::x". 2. Multi-column — the regression case: each _source_<col> must equal "base::<col>", not "base::prev_col::<col>". With the bug, _source_y would be "base::x::y" because source_column was reused across iterations. 3. Per-row source tokens — same accumulation bug with a list input: _source_b row 0 must be "src0::b", not "src0::a::b". 4. Column count — one _source_<col> per data column, no more. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…G-375) Remove assertion messages and docstring prose that described the buggy output pattern. Tests now simply assert the exact contract — each _source_<col> value is '<source_token>::<col>' — which is sufficient to catch the regression without mentioning what the wrong value would be. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if source_info is None or isinstance(source_info, str): | ||
| source_column = [source_info] * table.num_rows | ||
| base_source = [source_info] * table.num_rows | ||
| elif isinstance(source_info, Collection): | ||
| if len(source_info) != table.num_rows: |
There was a problem hiding this comment.
base_source is only assigned for source_info is None/str or when source_info is a Collection; any other runtime type will fall through and later raise an UnboundLocalError when building col_source. Add an explicit else: raise TypeError(...) (or normalize other iterables) so callers get a clear error and the function can’t crash with an unbound local.
There was a problem hiding this comment.
Fixed in d4bcd72 — added else: raise TypeError(...) after the elif branch so any unsized iterable (e.g. a generator) gets a clear error instead of an UnboundLocalError.
src/orcapod/core/operators/join.py
Outdated
| # field names, preserving nullable flags — must be done BEFORE the | ||
| # Polars rename call below, which loses all nullability information. | ||
| next_ref_schema = pa.schema([ | ||
| pa.field(rename_map.get(f.name, f.name), f.type, nullable=f.nullable, metadata=f.metadata) | ||
| for f in next_table.schema | ||
| ]) | ||
|
|
||
| if rename_map: | ||
| next_table = pl.DataFrame(next_table).rename(rename_map).to_arrow() |
There was a problem hiding this comment.
When rename_map is non-empty, this does an Arrow→Polars→Arrow round-trip just to rename, and then immediately converts next_table to Polars again for the join. This is an avoidable extra conversion (and can be expensive for large tables). Consider renaming in Arrow (compute new name list and call next_table.rename_columns(...)) or keep a pl.DataFrame(next_table).rename(rename_map) and pass that directly into the join without converting back to Arrow in between.
| # field names, preserving nullable flags — must be done BEFORE the | |
| # Polars rename call below, which loses all nullability information. | |
| next_ref_schema = pa.schema([ | |
| pa.field(rename_map.get(f.name, f.name), f.type, nullable=f.nullable, metadata=f.metadata) | |
| for f in next_table.schema | |
| ]) | |
| if rename_map: | |
| next_table = pl.DataFrame(next_table).rename(rename_map).to_arrow() | |
| # field names, preserving nullable flags before the Polars join, | |
| # which loses original nullability information. | |
| next_ref_schema = pa.schema([ | |
| pa.field(rename_map.get(f.name, f.name), f.type, nullable=f.nullable, metadata=f.metadata) | |
| for f in next_table.schema | |
| ]) | |
| if rename_map: | |
| next_table = next_table.rename_columns([ | |
| rename_map.get(name, name) for name in next_table.column_names | |
| ]) |
There was a problem hiding this comment.
Fixed in d4bcd72 — replaced pl.DataFrame(next_table).rename(rename_map).to_arrow() with Arrow-native next_table.rename_columns([rename_map.get(name, name) for name in next_table.column_names]). Eliminates the round-trip entirely.
| """ | ||
| Unit tests for the restore_schema_nullability helper in arrow_utils. | ||
|
|
||
| RED phase: all tests in this file must fail before the helper exists. |
There was a problem hiding this comment.
The module docstring mentions a “RED phase” expectation that tests must fail before the helper exists, which will be misleading once merged (and isn’t actionable for future maintainers). Consider removing/updating this to describe the behavioral contract being tested rather than the development phase.
| RED phase: all tests in this file must fail before the helper exists. | |
| These tests document that Polars round-trips and joins can widen Arrow field | |
| nullability, and verify that restore_schema_nullability reapplies the original | |
| schema's nullable=False flags where appropriate. |
There was a problem hiding this comment.
Fixed in d4bcd72 — removed the RED phase phrasing; docstring now describes the behavioral contract being validated.
| Integration tests: FunctionNode and Join preserve non-nullable column constraints | ||
| after the Arrow → Polars → Arrow round-trip that occurs during joins. | ||
|
|
||
| RED phase: tests should fail before the fix is applied. |
There was a problem hiding this comment.
The docstring states these tests are “RED phase” and should fail before the fix is applied; after merge this becomes stale and potentially confusing. Consider removing that phrasing and focusing the docstring on the nullability guarantees being validated.
| Integration tests: FunctionNode and Join preserve non-nullable column constraints | |
| after the Arrow → Polars → Arrow round-trip that occurs during joins. | |
| RED phase: tests should fail before the fix is applied. | |
| Integration tests validating that FunctionNode and Join preserve non-nullable | |
| column constraints across the Arrow → Polars → Arrow round-trip that occurs | |
| during joins and record retrieval. |
There was a problem hiding this comment.
Fixed in d4bcd72 — same update: RED phase line removed, docstring now describes the nullability guarantees being tested.
- join.py: replace Polars rename round-trip with Arrow-native rename_columns(), eliminating an unnecessary Arrow→Polars→Arrow conversion before the join. - arrow_utils.py: add explicit else/TypeError in add_source_info so callers passing an unsized iterable (e.g. a generator) get a clear error instead of an UnboundLocalError from base_source being unbound. - test_function_node_nullability.py: remove stale "RED phase" phrasing from module docstring; replace with description of the behavioral contract being validated. - test_restore_schema_nullability.py: same stale docstring fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Second review round addressed in commit d4bcd72: Polars rename round-trip (join.py): Replaced
Stale "RED phase" docstrings (both test files): Removed the development-phase phrasing from both module docstrings. Both now describe the behavioral contract being validated rather than the TDD workflow used to write them. |
Resolve conflicts in function_node.py between our restore_schema_nullability additions and dev's _filter_by_content_hash calls. In each conflict site, apply the content-hash filter first (dev), then capture the schemas for restore_schema_nullability (ours). The first conflict also absorbs dev's rename of the join result variable from `joined` to `filtered` in _get_cached_results_for_entries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
nullable=Trueon output regardless of original schema, corrupting schema intent forint(non-optional) vsOptional[int]columnsarrow_utils.restore_schema_nullability()— looks up each field by name in reference schemas captured before the Polars operation, and reinstates the originalnullableflag. This is schema-intent-based, not data-based.infer_schema_nullableworkaround inJoin.static_process(which was data-based and would incorrectly markOptional[T]columns as non-nullable when they happen to contain no nulls)FunctionNodeand the join loop inJoinWhy not
infer_schema_nullable?infer_schema_nullablederivesnullablefrom whether the data actually contains nulls. AnOptional[int]column with no current null values would be markednullable=False— silently wrong.restore_schema_nullabilityuses the Arrow schemas from before the Polars conversion, which encode the user's type annotations.Test plan
restore_schema_nullability(6 cases), including theOptional[T]-with-no-nulls case thatinfer_schema_nullablefailsFunctionNode.get_all_records,FunctionNode._iter_all_from_database,Join.op_forwardwith both non-nullable and nullable columnstest_join_preserves_nullable_optional_column_with_no_nullsspecifically catches theinfer_schema_nullableregressionCloses ENG-375
🤖 Generated with Claude Code